Load data. Please check the file path.

reviews2.csv <- read.csv('~/Dropbox/Eugenie/data/arslan-reviews2.csv')

Turn numeric values to factors.

## turn numeric values to factors
reviews2.csv$is_deleted <- as.factor(reviews2.csv$is_deleted)
reviews2.csv$incentivized <- as.factor(reviews2.csv$incentivized)
reviews2.csv$verified_purchaser <- as.factor(reviews2.csv$verified_purchaser)

## change level values
levels(reviews2.csv$verified_purchaser) <- c("unverified", "verified")
levels(reviews2.csv$incentivized) <- c("non-incentivized", "incentivized")
levels(reviews2.csv$is_deleted) <- c("kept", "deleted")

1 Data processing

## get relevant columns
cols <- c('recid', 'item_id', 'user_id', 'text')
reviews2.text <- as.data.frame(reviews2.csv[, cols])

## turn numeric values to factors
reviews2.text$recid <- as.factor(reviews2.text$recid)

## turn factors to char vectors for tidy unnest_tokens
reviews2.text$text <- as.character(reviews2.text$text)

## get tidy tokens for each review record
tidy.reviews2.text <- reviews2.text %>% 
  unnest_tokens(word, text)

## remove stop words
data(stop_words)
tidy.reviews2.text <- tidy.reviews2.text %>% 
  anti_join(stop_words)
## Joining, by = "word"

Words with over 20,000 appearances in reviews as a whole.

2 Sentiment Analysis with Three Lexicons

2.1 Sentiment Analysis with Afinn Lexicon

2.1.1 Calculate Afinn Sentiment Score and Index

## inner join with afinn sentiments
afinn.reviews2 <- tidy.reviews2.text %>%
  inner_join(get_sentiments('afinn')) 
## Joining, by = "word"
## sum the sentiment of words by record
afinn.reviews2 <- afinn.reviews2 %>%
  group_by(recid) %>%
  mutate(word.count=n()) %>%
  mutate(afinn.sentiment=sum(value)) %>%
  mutate(method='AFINN')

Get an index ranged from -1 to 1 using

  1. word count per review

  2. rescale function from package scales

## scale to -1 to 1 index, afinn sentiment is calculated on a -5 to 5 scale
afinn.reviews2 <- afinn.reviews2 %>%
  mutate(afinn.index=(afinn.sentiment/word.count)/5)

## scale to -1 to 1 index with rescale function from package scales
afinn.reviews2$afinn.sentiment.std <- rescale(afinn.reviews2$afinn.sentiment, to=c(-1,1))

Left join to selected columns of the original review2.csv file

reviews2.sentiment <- merge(reviews2.csv[,c('recid', 'item_id', 'rating', 'incentivized', 'is_deleted', 'verified_purchaser', 'text', 'title')], 
                            afinn.reviews2[,c('recid', 'afinn.sentiment', 'afinn.index', 'afinn.sentiment.std')], by='recid', all.x = T)

Observe missing values: ~86205 records don’t have an afinn sentiment index

summary(reviews2.sentiment$afinn.index)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1.00   -0.03    0.20    0.16    0.40    1.00   86205
reviews2.sentiment[, c('incentivized','afinn.index','afinn.sentiment')] %>%
  group_by(incentivized) %>%
  summarize_all(mean, na.rm = TRUE)
## # A tibble: 2 x 3
##   incentivized     afinn.index afinn.sentiment
##   <fct>                  <dbl>           <dbl>
## 1 non-incentivized       0.159            2.49
## 2 incentivized           0.203           11.9

The summary statistics is quite surprising. I wonder if I made mistakes in any steps (e.g., there may be problems when I normalize the sentiment scores to an index)

  • There is a big difference in the Afinn sentiment score between non-incentivized vs. incentivized reviews

  • However, the difference is not as pronounced in the normalized index between the two groups

2.1.2 Plots

Remove records with any na, so only ~60% of the data is available for ploting

reviews2.sentiment.all <- na.omit(reviews2.sentiment)

Boxplot: afinn sentiment score vs. rating

This record is the extreme outlier in the boxplot. The text review has 2476 words, which is the longest review in the dataset. It has a rating of 5 but receives a low sentiment index in all three lexicons. May need some additional investigation/processing.

reviews2.csv[reviews2.csv$recid == 35676004, c('recid', 'item_id', 'rating', 'incentivized', 'is_deleted', 'verified_purchaser', 'title', 'word_count')]
##          recid    item_id rating incentivized is_deleted
## 22625 35676004 B01422TC14      5 incentivized    deleted
##       verified_purchaser
## 22625         unverified
##                                                                                      title
## 22625 Something for the weekend sir?. A 79.01%* efficient power bank that keeps on giving.
##       word_count
## 22625       2476

Boxplot: afinn sentiment index vs. rating

2.1.3 Linear Model

Use Afinn sentiment score to replce rating

# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- afinn.sentiment ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within", 
##     index = c("item_id"))
## 
## Unbalanced Panel: n = 101, T = 18-20532, N = 411039
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -144.697026   -3.173747   -0.058654    2.754540   44.176404 
## 
## Coefficients:
##                             Estimate Std. Error t-value  Pr(>|t|)    
## incentivizedincentivized    6.706639   0.059600 112.528 < 2.2e-16 ***
## is_deleteddeleted           1.201850   0.036068  33.322 < 2.2e-16 ***
## verified_purchaserverified -1.614790   0.034614 -46.652 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    15475000
## Residual Sum of Squares: 14200000
## R-Squared:      0.082417
## Adj. R-Squared: 0.082187
## F-statistic: 12303.4 on 3 and 410935 DF, p-value: < 2.22e-16

Use Afinn sentiment index to replce rating

# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- afinn.index ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within", 
##     index = c("item_id"))
## 
## Unbalanced Panel: n = 101, T = 18-20532, N = 411039
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -1.2398660 -0.1846995  0.0096222  0.2053119  1.0659298 
## 
## Coefficients:
##                             Estimate Std. Error t-value Pr(>|t|)    
## incentivizedincentivized   0.0325484  0.0030401 10.7064  < 2e-16 ***
## is_deleteddeleted          0.0230638  0.0018397 12.5364  < 2e-16 ***
## verified_purchaserverified 0.0044500  0.0017656  2.5204  0.01172 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    36993
## Residual Sum of Squares: 36946
## R-Squared:      0.0012541
## Adj. R-Squared: 0.0010037
## F-statistic: 171.994 on 3 and 410935 DF, p-value: < 2.22e-16

2.2 Sentiment Analysis with Bing Lexicon

2.2.1 Calculate Bing Sentiment Score and Index

## inner join with bing sentiments
bing.reviews2 <- tidy.reviews2.text %>%
  inner_join(get_sentiments('bing')) 
## Joining, by = "word"
## get sentiments of records by counting the number of positive vs. negative words per record
bing.reviews2 <- bing.reviews2 %>%
  group_by(recid) %>%
  summarise(positive.count=sum(sentiment=='positive'),
            negative.count=sum(sentiment=='negative'))

bing.reviews2 <- bing.reviews2 %>%
  mutate(bing.sentiment=positive.count-negative.count) %>%
  mutate(word.count=positive.count+negative.count) %>%
  mutate(method='BING')

Get an index ranged from -1 to 1 using

  1. word count per review. Bing sentiment is calculated on a binary (positive/negative) scale

  2. rescale function from package scales

## scale to -1 to 1 index based on word count per review
bing.reviews2 <- bing.reviews2 %>%
  mutate(bing.index=bing.sentiment/word.count)

## scale to -1 to 1 index
bing.reviews2$bing.sentiment.std <- rescale(bing.reviews2$bing.sentiment, to=c(-1,1))
## observe outliers, which make the standardize tool not so helpful

Left join to selected columns of the original review2.csv file

## left join
reviews2.sentiment <- merge(reviews2.sentiment, bing.reviews2[,c('recid', 'bing.sentiment', 'bing.index', 'bing.sentiment.std')], by='recid', all.x = T)

## deduplicate
reviews2.sentiment <- reviews2.sentiment[!duplicated(reviews2.sentiment), ]

Observe missing values: ~72728 records don’t have a bing sentiment index

summary(reviews2.sentiment$bing.index)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1.00    0.00    1.00    0.45    1.00    1.00   72728
reviews2.sentiment[, c('incentivized','bing.index','bing.sentiment')] %>%
  group_by(incentivized) %>%
  summarize_all(mean, na.rm = TRUE)
## # A tibble: 2 x 3
##   incentivized     bing.index bing.sentiment
##   <fct>                 <dbl>          <dbl>
## 1 non-incentivized      0.448          0.911
## 2 incentivized          0.560          5.47

There is a big difference in both of the Bing sentiment score and the index between non-incentivized vs. incentivized reviews.

2.2.2 Plots

Remove records with any na, so only ~60% of the data is available for ploting

reviews2.sentiment.all <- na.omit(reviews2.sentiment)

Boxplot: bing sentiment score vs. rating

Boxplot: bing sentiment index vs. rating

2.2.3 Linear Model

Use Bing sentiment score to replce rating

# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- bing.sentiment ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within", 
##     index = c("item_id"))
## 
## Unbalanced Panel: n = 101, T = 10-6697, N = 167323
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -19.26536  -0.84110   0.16953   0.97295  40.11604 
## 
## Coefficients:
##                             Estimate Std. Error t-value  Pr(>|t|)    
## incentivizedincentivized    3.920282   0.050193  78.104 < 2.2e-16 ***
## is_deleteddeleted           0.339129   0.019828  17.104 < 2.2e-16 ***
## verified_purchaserverified -0.300151   0.019834 -15.133 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    662620
## Residual Sum of Squares: 625440
## R-Squared:      0.056123
## Adj. R-Squared: 0.055542
## F-statistic: 3314.29 on 3 and 167219 DF, p-value: < 2.22e-16

Use Bing sentiment index to replce rating

# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- bing.index ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within", 
##     index = c("item_id"))
## 
## Unbalanced Panel: n = 101, T = 10-6697, N = 167323
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -1.71628 -0.44976  0.37199  0.50104  1.20615 
## 
## Coefficients:
##                             Estimate Std. Error t-value  Pr(>|t|)    
## incentivizedincentivized   0.1207862  0.0176618  6.8389 8.010e-12 ***
## is_deleteddeleted          0.0559708  0.0069769  8.0223 1.044e-15 ***
## verified_purchaserverified 0.0542145  0.0069790  7.7682 8.003e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    77520
## Residual Sum of Squares: 77439
## R-Squared:      0.0010419
## Adj. R-Squared: 0.00042656
## F-statistic: 58.1347 on 3 and 167219 DF, p-value: < 2.22e-16

2.3 Sentiment Analysis with Loughran Lexicon

2.3.1 Calculate Loughran Sentiment Score and Index

loughran.pos.neg <- get_sentiments("loughran") %>% 
  filter(sentiment %in% c("positive", "negative"))

## inner join with loughrun sentiments
loughran.reviews2 <- tidy.reviews2.text %>%
  inner_join(loughran.pos.neg)
## Joining, by = "word"
## get sentiments of records by counting the number of positive vs. negative words per record
loughran.reviews2 <- loughran.reviews2 %>%
  group_by(recid) %>%
  summarise(positive.count=sum(sentiment=='positive'),
            negative.count=sum(sentiment=='negative'))

loughran.reviews2 <- loughran.reviews2 %>%
  mutate(loughran.sentiment=positive.count-negative.count) %>%
  mutate(word.count=positive.count+negative.count) %>%
  mutate(method='LOUGHRAN')

Get an index ranged from -1 to 1 using

  1. word count per review. Loughran sentiment is calculated on a binary (positive/negative) scale

  2. rescale function from package scales

## scale to -1 to 1 index based on word count per review
loughran.reviews2 <- loughran.reviews2 %>%
  mutate(loughran.index=loughran.sentiment/word.count)

## scale to -1 to 1 index
loughran.reviews2$loughran.sentiment.std <- rescale(loughran.reviews2$loughran.sentiment, to=c(-1,1))
## observe outliers, which make the standardize tool not so helpful

Left join to selected columns of the original review2.csv file

## left join
reviews2.sentiment <- merge(reviews2.sentiment, loughran.reviews2[,c('recid', 'loughran.sentiment', 'loughran.index', 'loughran.sentiment.std')], by='recid', all.x = T)

## deduplicate
reviews2.sentiment <- reviews2.sentiment[!duplicated(reviews2.sentiment), ]

Observe missing values: ~149126 records don’t have a bing sentiment index. Roughly twice as the na’s in Bing and Afinn Lexicon

summary(reviews2.sentiment$loughran.index)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1.00   -1.00    1.00    0.23    1.00    1.00  149126
reviews2.sentiment[, c('incentivized','loughran.index','loughran.sentiment')] %>%
  group_by(incentivized) %>%
  summarize_all(mean, na.rm = TRUE)
## # A tibble: 2 x 3
##   incentivized     loughran.index loughran.sentiment
##   <fct>                     <dbl>              <dbl>
## 1 non-incentivized          0.233              0.234
## 2 incentivized              0.237              0.659

There is only a slight difference in both of the Bing sentiment score and the index between non-incentivized vs. incentivized reviews.

2.3.2 Plots

Remove records with any na, so only ~45% of the data is available for ploting

reviews2.sentiment.all <- na.omit(reviews2.sentiment)

Boxplot: Loughran sentiment score vs. rating

Boxplot: Loughran sentiment index vs. rating

2.3.3 Linear Model

Use Loughran sentiment score to replce rating

# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- loughran.sentiment ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within", 
##     index = c("item_id"))
## 
## Unbalanced Panel: n = 101, T = 5-4736, N = 107208
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -23.46814  -1.03705   0.43128   0.84801  11.21556 
## 
## Coefficients:
##                            Estimate Std. Error t-value  Pr(>|t|)    
## incentivizedincentivized   0.242001   0.040223  6.0165 1.788e-09 ***
## is_deleteddeleted          0.158052   0.018247  8.6616 < 2.2e-16 ***
## verified_purchaserverified 0.019979   0.017708  1.1283    0.2592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    211700
## Residual Sum of Squares: 211360
## R-Squared:      0.0015982
## Adj. R-Squared: 0.00063803
## F-statistic: 57.1484 on 3 and 107104 DF, p-value: < 2.22e-16

Use Loughran sentiment index to replce rating

# in model.fe, index = c('item_id') defines 'item_id' as the entity
formula.fe <- loughran.index ~ incentivized + is_deleted + verified_purchaser
model.fe <- plm(data = reviews2.sentiment.all, formula = formula.fe, index = c('item_id'), model = 'within')
# get the model summary
summary(model.fe)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = formula.fe, data = reviews2.sentiment.all, model = "within", 
##     index = c("item_id"))
## 
## Unbalanced Panel: n = 101, T = 5-4736, N = 107208
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -1.76389 -0.96487  0.48665  0.69911  1.50394 
## 
## Coefficients:
##                             Estimate Std. Error t-value  Pr(>|t|)    
## incentivizedincentivized   -0.032729   0.024253 -1.3495    0.1772    
## is_deleteddeleted           0.072872   0.011003  6.6232 3.531e-11 ***
## verified_purchaserverified  0.068195   0.010677  6.3870 1.698e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    76899
## Residual Sum of Squares: 76842
## R-Squared:      0.00074249
## Adj. R-Squared: -0.00021847
## F-statistic: 26.5277 on 3 and 107104 DF, p-value: < 2.22e-16

2.4 Missing Sentiment Scores

2.4.1 Records with Missing Scores in All Three Lexicons

Observe 22% of total records are missing all three lexicons sentiment scores

##   afinn.sentiment bing.sentiment loughran.sentiment   Freq   Freq_pct
## 1           FALSE          FALSE              FALSE 107208 40.5281880
## 2            TRUE          FALSE              FALSE   4989  1.8860079
## 3           FALSE           TRUE              FALSE   2168  0.8195761
## 4            TRUE           TRUE              FALSE   1036  0.3916424
## 5           FALSE          FALSE               TRUE  60115 22.7254685
## 6            TRUE          FALSE               TRUE  19487  7.3667338
## 7           FALSE           TRUE               TRUE   8831  3.3384116
## 8            TRUE           TRUE               TRUE  60693 22.9439717